Facebook-owned sites were down | Hacker News

Hacker News new | threads | past | comments | ask | show | jobs | submit

DanielBMarkham (43766) | logout

		Facebook-owned sites were down (facebook.com)
		2583 points by nabeards 1 day ago \| flag \| hide \| past \| favorite \| 1295 comments

Animats 1 day ago [–]

There's still no connectivity to Facebook's DNS servers:

    > traceroute a.ns.facebook.com
      traceroute to a.ns.facebook.com (129.134.30.12), 30 hops max, 60 byte packets
      1  dsldevice.attlocal.net (192.168.1.254)  0.484 ms  0.474 ms  0.422 ms
      2  107-131-124-1.lightspeed.sntcca.sbcglobal.net (107.131.124.1)  1.592 ms  1.657 ms  1.607 ms 
      3  71.148.149.196 (71.148.149.196)  1.676 ms  1.697 ms  1.705 ms
      4  12.242.105.110 (12.242.105.110)  11.446 ms  11.482 ms  11.328 ms
      5  12.122.163.34 (12.122.163.34)  7.641 ms  7.668 ms  11.438 ms
      6  cr83.sj2ca.ip.att.net (12.122.158.9)  4.025 ms  3.368 ms  3.394 ms
      7  * * *
      ...

So they're hours into this outage and still haven't re-established connectivity to their own DNS servers.

Animats 1 day ago [–]

"facebook.com" is registered with "registrarsafe.com" as registrar. "registrarsafe.com" is unreachable because it's using Facebook's DNS servers and is probably a unit of Facebook. "registrarsafe.com" itself is registered with "registrarsafe.com".

I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.

Animats 1 day ago [–]

Notes as Facebook comes back up:

"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.

That's what you have to do to really own a domain.

mise_en_place 1 day ago [–]

Out of curiosity, I looked up how much it costs to become an registrar. Based on the ICANN site, it is $4,000 USD per yr, plus variable fees and transactions fees ($0.18/yr). Does anyone have experience or insight into running a domain registrar? Curious what it would entail (aside from typical SRE type stuff).

computator 1 day ago [–]

> transactions fees ($0.18/yr)

Wow, I had no idea it was so cheap[1] once you're a registrar. The implication is that anyone who wants to be a domain squatting tycoon should become a registrar. For an annual cost of a few thousand dollars plus $0.18 per domain name registered, you can sit on top of hundreds of thousands of domain names. Locking up one million domain names would cost you only $180,000 a year. Anytime someone searched for an unregistered domain name on your site, you could immediately register it to yourself for $0.18, take it off the market, and offer to sell it to the buyer at a much inflated price. Does ICANN have rules against this? Surely this is being done?

[1] "Transaction-based fees - these fees are assessed on each annual increment of an add, renew or a transfer transaction that has survived a related add or auto-renew grace period. This fee will be billed at USD 0.18 per transaction." as quoted from https://www.icann.org/en/system/files/files/registrar-billin...

lstodd 1 day ago [–]

> Surely this is being done?

Personally saw this kind of thing as early as 2001.

Never search for free domains on the registar site unless you are going to register it immediately. Even whois queries can trigger this kind of thing, although that mostly happens on obscure gtld/cctld registries which have a single registrar for the whole tld.

exikyut 19 hours ago [–]

I can sadly attest to this behavior as recently as a couple years ago :(

I searched for a domain that I couldn't immediately grab (one of more expensive kind) using a random free whois site... and when I revisited the domain several weeks later it was gone :'(

Emailed the site's new owner D: but fairly predictably got no reply.

Lesson learned, and thankfully on a domain that wasn't the absolute end of the world.

I now exclusively do all my queries via the WHOIS protocol directly. Welp.

aasasd 1 day ago [–]

> Surely this is being done?

Probably every major retail registrar was rumored to do this at some point. Add to your calculation that even some heavyweights like GoDaddy (IIRC) tend to run ads on domains that don't have IPs specified.

Animats 1 day ago [–]

Network Solutions definitely did it. I searched for a few domains along the lines of "network-solutions-is-a-scam.com", and watched them come up in WHOIS and DNS.

charcircuit 14 hours ago [–]

There are also fees you have to pay to the owner of the tld. For example .com has a $8.39 fee. In total that would be $8.57 per .com domain.

You are off by a factor of almost 50.

amerine 1 day ago [–]

They have a pretty interesting page on the topic: https://www.icann.org/resources/pages/financials-55-2012-02-...

They want you to have $70k liquid.

ls15 1 day ago [–]

And they want you to be someone else than Peter Sunde:

https://torrentfreak.com/icann-refuses-to-accredit-pirate-ba...

robalfonso 1 day ago [–]

This is not completely accurate. The whole reason a registrar with domain abc.com can use ns1.abc.com is because glue records are established at the registry, this allows a bootstrap that keeps you in from a circular dependency. All that said it’s usually a bad idea, for someone as large as Facebook they should have nameservers across zones ie a.ns.fb.com b.ns.fb.org c.ns.fb.co Etc…

john37386 1 day ago [–]

There is always a step which involve to email the domain when a domain update its information with the registrar. In this case, facebook.com and registrarsafe.com are managed by the same NS. You need these NS to query the MX to send that update approval by email and unblock the registrar update. Glue records are more for performance than to make that loop. I'm maybe missing something but, hopefully they won't need to send an email to fix this issue.

jfrunyon 1 day ago [–]

I have literally never once received an email to confirm a domain change. Perhaps the only exception is on a transfer to another registrar (though I can't recall that occurring, either).

To be fair, we did have to get an email from eurid recently for a transfer auth code, but that was only because our registrar was not willing to provide.

In any case, no, they will not need to send an email to fix this issue.

rmason 1 day ago [–]

I just changed the email address on all my domains. My inbox got flooded with emails across three different domain vendors. If they didn't do it before, they sure are doing it now.

john37386 1 day ago [–]

Yes I meant for transferring to another DNS server. In this case, they can't.

robalfonso 1 day ago [–]

This is not true when your the registrar (as in this case) in fact your entire system could be down and you’d still have access to the registries system to do this update

3np 1 day ago [–]

FB is running their own registrar. Supposedly they can sidestep the email procedure if it's even there to begin with.

jacurtis 1 day ago [–]

Facebook does operate their own private Registrar, since they operate tens of thousands of domains. Most of these are misspellings and domains from other countries and so forth.

So yes, the registrar that is to blame is themselves.

Source: I know someone within the company that works in this capacity.

thiht 1 day ago [–]

> That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.

Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.

BillinghamJ 1 day ago [–]

When the NS hostname is dependent on the domain it serves, "glue records" cover the resolution to the NS IP addresses. So there's no circular dependency type issue

john37386 1 day ago [–]

Good catch. Hopefully, they won't need an email sent to fb.com from registrarsafe.com to update an important record to fix this. What a loop.

mdtancsa 1 day ago [–]

Its partially there. C and D are still not in the global tables according to routeviews ie. 185.89.219.12 is still not being advertised to anyone. My peers to them in Toronto have routes from them, but not sure how far they are supposed to go inside their network. (past hop 2 is them)

% traceroute -q1 -I a.ns.facebook.com

traceroute to a.ns.facebook.com (129.134.30.12), 64 hops max, 48 byte packets 1 torix-core1-10G (67.43.129.248) 0.133 ms

2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 1.317 ms

3 157.240.43.214 (157.240.43.214) 1.209 ms

4 129.134.50.206 (129.134.50.206) 15.604 ms

5 129.134.98.134 (129.134.98.134) 21.716 ms

6 *

7 *

% traceroute6 -q1 -I a.ns.facebook.com

traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets

1 toronto-torix-6 0.146 ms

2 facebook-a.ip6.torontointernetxchange.net 17.860 ms

3 2620:0:1cff:dead:beef::2154 9.237 ms

4 2620:0:1cff:dead:beef::d7c 16.721 ms

5 2620:0:1cff:dead:beef::3b4 17.067 ms

6 *

7 *

8 *

boshomi 1 day ago [–]

Kevin Beaumont:

   »The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«

https://twitter.com/GossiTheDog/status/1445118907187175427

Twisol 1 day ago [–]

Oh my gosh, their IPv6 address contains "face:b00c"...

> 2a03:2880:f0fc:c:face:b00c:0:35

therein 1 day ago [–]

Besides being fun and quirky, it is actually useful for their sysadmins as well as sysadmins at other orgs.

Well at least it will in 2036, when IPv6 goes mainstream.

forgotpwd16 1 day ago [–]

How difficult is to get such a "vanity" address?

account42 1 day ago [–]

You just need to get a large enough block so that you can throw most of it away by adding your own vanity part to the prefix you are given. IPv6 really isn't scarce so you can actually do that.

willstrafach 22 hours ago [–]

The face:b00c part is in the Interface ID, so this did not even need a large block (Though I am sure they have one).

mikefromhome 1 day ago [–]

dead beef sounds about right

kiernanmcgowan 1 day ago [–]

My suspicion is that since a lot of internal comms runs through the FB domain and since everyone is still WFH, then its probably a massive issue just to get people talking to each other to solve the problem.

okwubodu 1 day ago [–]

I don’t know how true it is but a few reports claim employees can’t get into the building with their badges.

cududa 1 day ago [–]

I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"

tablespoon 1 day ago [–]

> I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"

Every internet-connected physical system needs to have a sensible offline fallback mode. They should have had physical keys, or at least some kind of offline RFID validation (e.g. continue to validate the last N badges that had previously successfully validated).

Bluecobra 1 day ago [–]

In case of emergency, break glass...

...the doors are glass right?

cududa 1 day ago [–]

Zucks personal conference room has 3 glass walls, so I’ve been amusing myself imagining him just throwing a chair through one of the walls

samstave 1 day ago [–]

That glass is bullet resistant.

thrwyoilarticle 1 day ago [–]

Do they (you?) call him that at FB?

xapata 1 day ago [–]

Yes, "Zuck".

hellbannedguy 1 day ago [dead] [–]

I don't think he has the strength.

dwd 1 day ago [–]

I'm assuming someone in building security has watched the end of Ex Machina...and applied some learnings, or not.

tetha 1 day ago [–]

All doors are glass with the right combination of a halligan bar, an axe and a gasoline powered saw.

And I guess beyond that point, walls are glass. Or you need explosives.

Bombthecat 1 day ago [–]

Aaaaaaand it's down!

skeeter2020 1 day ago [–]

maybe they're open by default, like old 7-11 stores when they went 24hrs and had no locks on the doors :)

londons_explore 1 day ago [–]

Breaking the glass to get in to fix the service is totally a good business move.

A few hundred bucks of glass Vs a billion wiped off the share price if the service is down for a day and all the user's go find alternatives.

korethr 1 day ago [–]

Link to such claims here: https://news.ycombinator.com/item?id=28750894

I have no doubt that the publicly published post-mortem report (if there even is one) will be heavily redacted in comparison to the internal-only version. But I very much want to see said hypothetical report anyway. This kind of infrastructural stuff fascinates me. And I would hope there would be some lessons in said report that even small time operators such as myself would do well to heed.

RichardCA 1 day ago [–]

I think the real take away is that no one has this figured out.

A small company has to keep all of its customers happy (or at least be responsive when issues arise, at a bare minimum).

Massive companies deal in error budgets, where a fraction of a percent can still represent millions of users.

jonny_eh 1 day ago [–]

https://twitter.com/sheeraf/status/1445099150316503057

throwdecro 1 day ago [–]

I guess they didn't have an "emergency ingress" plan.

ToddWBurgess 1 day ago [–]

The they will have to old school it and try a brick.

metadaemon 1 day ago [–]

I've heard on Blind this is unrelated, more of a Covid restriction issue.

wolverine876 1 day ago [–]

What is Blind? Or shouldn't I ask?

monkeydust 1 day ago [–]

www.teamblind.com

Enjoy.

rvnx 1 day ago [–]

A copy of Glassdoor

tgsovlerkhgsel 1 day ago [–]

More like a crossover between Glassdoor... and Gab.

ithkuil 1 day ago [–]

first rule of Blind, never talk about Blind

still_grokking 1 day ago [–]

You mean the same problem as when GMail goes down and Googlers can't reach each other?

I guess good decentralized public communication services could solve those issues for everybody.

ddalex 1 day ago [–]

Googler here - my opinions are my own, not representing the company

at the lowest level in case of severe outage we resort to IRC, Plain Old Telephone Service and, sometimes, stick-it notes taped to windows...

Johnny555 1 day ago [–]

Around here we use Slack for primary communications, Google Hangouts (or Chat or whatever they call it now) as secondary, and we keep an on-call list with phone numbers in our main Git repo, so everyone has it checked out on their laptop, so if the SHTF, we can resort to voice and/or SMS.

I remembered to publish my cell phone's real number on the on-call list rather than just my Google Voice number since if Hangouts is down, Google Voice might be too.

texasbigdata 1 day ago [–]

Where are the tapes though? Colo on separate tectonic tape or nah?

Johnny555 1 day ago [–]

?

tijtij 1 day ago [–]

I think texasbigdata is talking about backup tapes and maybe mistyped tectonic plate

Backup tapes and in production servers are kept at different colocation sites to protect data from fire and other catastrophes of that level

Using colo sites on separate tectonic plates would protect you from catastrophes on a geological cataclysm level

Johnny555 1 day ago [–]

We don't use tapes, everything we have is in the cloud, at a minimum everything is spread over multiple datacenters (AZ's in AWS parlance), important stuff is spread over multiple regions, or depending on the data, multiple cloud providers.

Last time I used tape, we used Ironmountain to haul the tapes 60 miles away which was determined to be far enough for seismic safety, but that was over a decade ago.

texasbigdata 1 day ago [–]

Thank you kind sir.

jug 1 day ago [–]

Some people here say their fallback IRC doesn't work due to DNS reliance. :|

comonoid 1 day ago [–]

One of my employers once forced all the staff to use an internally-developed messenger (for sake of security, but some politics was involved as well), but made an exception for the devops team who used Telegram.

wut42 1 day ago [–]

Telegram? Interesting choice!

dikei 1 day ago [–]

Devops like Telegram because it has proper bot API, unlike many other competitors.

wut42 1 day ago [–]

Oh! It makes sense. While I don't like telegram for some reasons, their API is totally top notch and a real pleasure to work with.

lmitfb 1 day ago [–]

That would completely defeat the purpose... I have a hard time believing that.

jaywalk 1 day ago [–]

Why? Even if it's not DNS reliance, if they self-hosted the server (very likely) then it'll be just as unreachable as everything else within their network at the moment.

yupper32 1 day ago [–]

The entire purpose of an IRC backup is in case shit hits the fan. That means having it run on a completely separate stack.

What use is it if it runs on the same stack as what you might be trying to fix?

jaywalk 1 day ago [–]

Clearly "our entire network is down, worldwide" wasn't part of their planning. Don't get too cocky with your 20/20 hindsight.

yupper32 1 day ago [–]

I don't think it's cocky or 20/20 hindsight. Companies I've worked for specifically set up IRC in part because "our entire network is down, worldwide" can happen and you need a way to communicate.

nl 1 day ago [–]

I bet they never tested taking out their own DNS.

IRC does use DNS at least to get hostnames during connection. I'd be surprised if it didn't use it at other points.

Retric 1 day ago [–]

I’ve setup hosts files in case DNS was down to access critical systems before. It’s a perfectly reasonable precaution.

edoceo 1 day ago [–]

My small org, maybe 50 ips/hosts we care about, maintain a hosts file stills, for those nodes public and internal names. It's in Git, spread around and we also have our fingers crossed.

littlecranky67 1 day ago [–]

If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

My bet is, FB will reach out to others in FAMANG, and an interest group will form maintaining such an emergency infrastructure comm network. Basically a network for network engineers. Because media (and shareholders) will soon ask Microsoft and Google what their plans for such situations are. I'm very glad FB is not in the cloud business...

rrix2 1 day ago [–]

> If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

yeah if only Facebook's production engineering team had hired a team of full time IRCops for their emergency fallback network...

littlecranky67 1 day ago [–]

Considering how much IRCops were paid back in the day (mostly zero as they were volunteers) and what a single senior engineer at FB makes, I'm sure you will find 3-4 people spread amongst the world willing to share this 250k+ salary amongst them.

ceva 1 day ago [–]

That is called outbound network :)

guidoism 1 day ago [–]

I worked on the identity system that chat (whatever the current name is) and gmail depend on and we used IRC since if we relied on the system we support we wouldn’t be able to fix it.

praptak 1 day ago [–]

Word is that the last time Google had a failure involving a cyclical dependency they had to rip open a safe. It contained the backup password to the system that stored the safe combination.

l9i 1 day ago [–]

The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.

The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Source: Chapter 1 of "Building Secure and Reliable Systems" (https://sre.google/static/pdf/building_secure_and_reliable_s... size warning: 9 MB)

brazzy 1 day ago [–]

Lovely.

Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"

But as they say: make something foolproof and nature will create a better fool.

anigbrowl 1 day ago [–]

I'm sure this sort of thing won't be a problem for a company whose founding ethos is 'move fast and break things.' O:-)

FearNotDaniel 1 day ago [–]

Anyone remember the 90s? There was this thing called the Information Superhighway, a kind of decentralised network of networks that was designed to allow robust communications without a single point of failure. I wonder what happened to that...?

ewalk153 1 day ago [–]

Folks are still chatting here... seems to work as designed...

wolverine876 1 day ago [–]

Aren't we still communicating on HN, even though the possibly largest network is down? Can you send email?

mastazi 1 day ago [–]

We are a dying breed... A few days ago my daughter asked me "will you send me the file on Whatsapp or Discord?". I replied I will send an email. She went "oh, you mean on Gmail?" :-D

whydoyoucare 1 day ago [–]

Hahaha... I can relate to that. Email is synonymous with Gmail now, something that only dads and uncles use. :-)

salawat 1 day ago [–]

Somehow I gotta figure out how to get kiddos interested in networking...

lancefisher 1 day ago [–]

Setting up a Minecraft server has been a good experience for my kiddo to learn more networking.

prox 1 day ago [–]

I am going to guess it’s one of those things the techies want to get round to, but in reality there is never any chance or will to do it.

l9i 1 day ago [–]

I can assure you that Google has a procedure in place for that.

l9i 1 day ago [–]

I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:

Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.

I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.

While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?

grayfaced 1 day ago [–]

In future news: Waymo outage results in engineers unable to get to data center. Engineers don't even know where their servers are.

shemnon42 1 day ago [–]

Give us the dirt on how google does it's disaster planning exercises please! Do you do these exercises all at once or slowly over the year?

l9i 1 day ago [–]

Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.

Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.

While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.

If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).

donalhunt 1 day ago [–]

Why not do both? ;)

ric2b 1 day ago [–]

Yup, they make a new chat app if the previous one is down.

gadnuk 1 day ago [–]

Google Talk, Google Voice, Google Buzz, Google+ Messenger, Hangouts, Spaces, Allo, Hangouts Chat, and Google Messages.

At some point, they must run out of names, right?

andrepd 1 day ago [–]

You forgot google meet!

darkhorn 1 day ago [–]

And Google Wave.

londons_explore 1 day ago [–]

You forgot the chat boxes inside other apps like Google docs, Gmail, YouTube, etc.

scatters 1 day ago [–]

And Google Pay, apparently.

mr_toad 1 day ago [–]

> Yup, they make a new chat app if the previous one is down.

Continuous Deployment.

knorker 1 day ago [–]

For those who don't know who he is: l9i would know this. Just clarifying that this is not an Internet nobody guessing.

sam_lowry_ 1 day ago [–]

He is still an anonymous dude to me.

danhak 1 day ago [–]

HN Profile -> Personal Website -> LinkedIn -> Over 10 years experience as Google Site Reliability Engineer

ant6n 1 day ago [–]

Is the LinkedIn profile linking back to the hn account?

sam_lowry_ 1 day ago [–]

Security Engineer asking?

ant6n 22 hours ago [–]

Ha, no. It just occured to me that any random hacker news account could link to somebody's personal account and claim authority on some subject.

e1g 1 day ago [–]

Google SRE for 10 years, ending as the Principal Site Reliability Engineer (L8).

sulam 1 day ago [–]

s/the//

Google has more than 1 L8 SRE.

jaywalk 1 day ago [–]

I don't know who either he or you are, so...

knorker 1 day ago [–]

I was clarifying his comment, since he didn't mention that this is not a guess, but inside knowledge.

I was not trying to establish a trust chain.

Take from it what you will.

astrange 1 day ago [–]

Why does it matter if he's guessing or not?

fragmede 1 day ago [–]

Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.

No shit Google has plans in place for outages.

But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.

astrange 1 day ago [–]

I've found that when I post things I learned on the job here it actually causes people to tell me I'm wrong or made it up even more often…

saagarjha 1 day ago [–]

It's kind of amusing given that employers are usually pretty easy to deduce based on comments…

new_guy 1 day ago [–]

That's just an 'appeal to authority'.

No-one knows or cares who made the statement, it may as well have been 'water is wet', it was useless and adds nothing but noise.

l9i 1 day ago [–]

I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.

I have replied to my initial comment with provide some additonal context: https://news.ycombinator.com/edit?id=28752431. Hope that helps.

heartbreak 1 day ago [–]

That’s…not what “appeal to authority” means.

still_grokking 1 day ago [–]

I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.

l9i 1 day ago [–]

A Gmail outage would be barely an inconvenience as Gmail plays a minor role in Google's disaster response.

Disclaimer: Ex-Googler who used to work on disaster reponse. Opinions are my own.

badrequest 1 day ago [–]

What do you think all those superfluous chat apps were for?

oconnor663 1 day ago [–]

I think the issue there is that in exchange for solving the "one fat finger = outage" problem, you lose the ability to update the server fleet quickly or consistently.

KaiserPro 1 day ago [–]

BGP is decentralised.

threevox 1 day ago [–]

LOL - score one against building out all tooling internally (a la Amazon and apparently Facebook too)

vineyardmike 1 day ago [–]

The rate at which some amazon services lately go done because other AWS services went down proves that this is an unsustainable house of cards anyways.

kevin_thibedeau 1 day ago [–]

Netflix knows how to build on top of a house of cards.

bickeringyokel 1 day ago [–]

There's a joke here somewhere about how bad the final season was

strulovich 1 day ago [–]

Those communications are done over irc at FB for exactly this purpose.

rStar 1 day ago [–]

time to start working at your mfing desk again, johnson

gocartStatue 1 day ago [–]

They supposedly can't enter facebook office right now. Their cards don't work.

forgotpwd16 1 day ago [–]

Why would a system like that have to be in their online infrastructure?

bostik 1 day ago [–]

For doing LDAP lookups against the corporate directory? Oh wait, LDAP configuration of course depends on DNS and DNS is kaputt...

eskathos 1 day ago [–]

source?

jjulius 1 day ago [–]

NYT reporter on Twitter.

https://twitter.com/sheeraf/status/1445099150316503057

BrianKamrany 1 day ago [–]

Sheera Frenkel @sheeraf Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

eskathos 1 day ago [–]

"Something went wrong. Try reloading."

its not loading for me. could you say what it said?

lynx234 1 day ago [–]

Saw this earlier: https://twitter.com/disclosetv/status/1445100931947892736

jnorthrop 1 day ago [–]

From the Tweet, "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."

eskathos 1 day ago [–]

"Something went wrong. Try reloading."

its not loading for me. could you say what it said?

david_allison 1 day ago [–]

> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://nitter.net/sheeraf/status/1445099150316503057

BrianKamrany 1 day ago [–]

Disclose.tv @disclosetv JUST IN - Facebook employees reportedly can't enter buildings to evaluate the Internet outage because their door access badges weren’t working (NYT)

justinzollars 1 day ago [–]

What do you think will be the impact on WFH and office requirements?

secondcoming 1 day ago [–]

Unlikely, PagerDuty was invented for this kind of thing

kiernanmcgowan 1 day ago [–]

Oh I'm sure everyone knows whats wrong, but how am I supposed to send an email, find a coworkers phone number, get the crisis team on video chat etc etc if all of those connections rely on the facebook domain existing?

ralphm 1 day ago [–]

Hence the suggestion for PagerDuty. It handles all this, because responders set their notification methods (phone, SMS, e-mail, and app) in their profiles, so that when in trouble nobody has to ask those questions and just add a person as a responder to the incident.

korethr 1 day ago [–]

Yes, but Facebook is not a small company. Could PagerDuty realistically handle the scale of notifications that would be required for Facebook's operations?

antoinealb 1 day ago [–]

PagerDuty does not solve some of the problems you would have at FB's scale, like how do you even know who to contact ? And how do they login once they know there is a problem ?

Spooky23 1 day ago [–]

Sure. As long as you plan for disaster.

The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.

robalfonso 1 day ago [–]

Even if it can’t, it’s trivial to use it for an important subset, ie is Facebook.com down, is the ns stuff down etc. So there is an argument to be made for still using an outside service as a fallback

anigbrowl 1 day ago [–]

Sure, if you're...

- not arrogant - or complacent - haven't inadvertently acquired the company - know your tech peers well enough to have confidence in their identity during an emergency - do regular drills to simulate everything going wrong at once

Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.

jfrunyon 1 day ago [–]

I guarantee you that every single person at Facebook who can do anything at all about this, already knows there's an issue. What would them receiving an extra notification help with?

robalfonso 1 day ago [–]

We kind of got off topic, I was arguing that if you were concerned about internal systems being down (including your monitoring/alerting) something like pager duty would be fine as a backup. Even at huge scale that backup doesn’t need to watch everything.

I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.

winternett 1 day ago [–]

Heck of a coincidence I must say...

I can imagine this affects many other sites that use FB for authentication and tracking.

If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...

Hopefully law makers all study up and pay close attention.

What transpires next may prove to be very interesting.

forgotpwd16 1 day ago [–]

Indeed, what happened shows a good reason not to rely only on social log-in for various sites.

kossTKR 1 day ago [–]

NYT tech reporter Sheera Frenkel gives us this update:

>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057

adriancooney 1 day ago [–]

Got a good chuckle imagining a fuming Zuckerberg not being allowed into his office, thinking the world is falling apart.

evolve2k 1 day ago [–]

Can’t get in to fix error

rootusrootus 1 day ago [–]

I just got off a short pre-interview conversation with a manager at Instagram and he had to dial in with POTS. I got the impression that things are very broken internally.

askvictor 1 day ago [–]

How much of modern POTS is reliant on VOIP? In Australia at least, POTS has been decommissioned entirely, but even where it's still running, I'm wondering where IP takes over?

rootusrootus 1 day ago [–]

I am guessing that most POTS is VOIP now, except for the few places with existing copper infrastructure that has not been decommissioned yet.

wolverine876 1 day ago [–]

This person has a POTS line in their current location, and a modem, and the software stack to use it, and Instagram has POTS lines and modems and software that connect to their networks? Wow. How well do Instagram and their internal applications work over 56K?

rootusrootus 1 day ago [–]

He called on his mobile phone. As a result it was a voice-only conversation, no video.

rescbr 1 day ago [–]

They could have dialed in by their own cell phone though

otikthecessna 1 day ago [–]

I read that as POTUS at first and paused for a minute

dividedbyzero 1 day ago [–]

What is POTS?

terramex 1 day ago [–]

Plain old telephone service https://en.wikipedia.org/wiki/Plain_old_telephone_service

woofcat 1 day ago [–]

Plain old telephone system. Aka a phone.

tacker2000 1 day ago [–]

Plain Old Telephone System

lbruder 1 day ago [–]

Looks like they misconfigured a web interface that they can't reach anymore now that they're off the net.

"anyone have a Cisco console cable lying around?"

CommieBobDole 1 day ago [–]

The only one they have is serial and the company's one usb-to-serial converter is missing.

Edman274 1 day ago [–]

The voices, stories, announcements, photos, hopes and sorrows of millions, no, literally billions of people, and the promise that they may one day be seen and heard again now rests in the hands of Dave, the one guy who is closest to a Microcenter, owns his own car and knows how to beat the rush hour traffic and has the good sense to not forget to also buy an RS-232 cable, since those things tend to get finicky.

ozfive 1 day ago [–]

Great visual!

john37386 1 day ago [–]

Yeah the patch to fix BGP to reach the DNS is sent by email to @facebook.com. Ooops no DNS to resolve the MX records to send the patch to fix the BGP routers.

yoelo 1 day ago [–]

Seriously? Is that how it works?

cranekam 1 day ago [–]

No. A network like Facebook's is vast and complicated and managed by higher-level configuration systems, not people emailing patches around.

If this issue is even to do with BGP it's much more likely the root of the problem is somewhere in this configuration system and that fixing it is compounded by some other issues that nobody foresaw. Huge events like this are always a perfect storm of several factors, any one or two of which would be a total noop alone.

KuiN 1 day ago [–]

The Swiss cheese model of accidents. Occasionally the holes all align.

https://en.wikipedia.org/wiki/Swiss_cheese_model

soneil 1 day ago [–]

The fun part of BGP is they apparently make a lot of use of it within their network, not just advertising routes externally.

https://engineering.fb.com/2021/05/13/data-center-engineerin...

(and yes, fb.com resolves)

weisk 1 day ago [–]

No, the backbone of the internet is not maintained with patches sent in emails.

chiluk 1 day ago [–]

You are very wrong about that ;) https://lkml.org/

chiluk 1 day ago [–]

You are very wrong about that https://lkml.org/

cbarrick 1 day ago [–]

Clearly you and the person you replied to are talking about very different things.

_joel 1 day ago [–]

I think the sub-comment is confusing the linux kernel with BGP.

nacs 1 day ago [–]

In a way, the Linux kernel does power the "backbones of the internet".

_joel 1 day ago [–]

There are a hell of a lot of non-linux OS's running on core routers, but yes, in a way. However BGP isn't via email.

dsr_ 1 day ago [–]

On the other hand, I and my office mate at the time negotiated the setup of a ridiculous number of BGP sessions over email, including sending configs. That was 20 years ago.

NexRebular 1 day ago [–]

luckily not... would be absolutely terrible to have the backbone only on linux

dsr_ 1 day ago [–]

Interoperability and a thriving ecosystem are necessities for resiliency.

Note that resiliency and efficiency are often working against each other.

chiluk 1 day ago [–]

https://lkml.org/

john37386 1 day ago [–]

I don't know. I doubt. It's just funny to think that you need email to fix BGP, but DNS is down because of BGP. You need DNS to send email which needs BGP. It's a kind of chicken and egg problem but at a massive scale this time.

boshomi 1 day ago [–]

Sheera Frenkel:

    Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057

_joel 1 day ago [–]

You'd think they'd have worked that into their DR plans for a complete P1 outage of the domain/DNS, but perhaps not, or at least they didn't add removal of BGP announcements to the mix.

alexvoda 1 day ago [–]

Can someone explain why it is also down when trying to access it via Tor using its onion address: http://facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5t...

Or when trying ips directly: https://www.lifewire.com/what-is-the-ip-address-of-facebook-...

I would have expected a DNS issue to not affect either of these.

I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.

spiantino 1 day ago [–]

You can get through to a web server, but that web server uses DNS records or those routes to hit other services necessary to render the page. So the server you hit will also time out eventually and return a 500

gamacodre 1 day ago [–]

The issue here is that this outage was a result of all the routes into their data centers being cut off (seemingly from the inside). So knowing that one of the servers in there is at IP address "1.2.3.4" doesn't help, because no-one on the outside even knows how to send a packet to that server anymore.

KaiserPro 1 day ago [–]

routing was down _everywhere_ so tor is getting a better experience than most people by getting a 500 error

keithnoizu 1 day ago [–]

DNS is back, looks like systems are still coming online.

bronlund 1 day ago [dead] [–]

Yeah that's some pretty hardcore A/B testing right there.

suyash 1 day ago [–]

Source (hacker group Anonymous) : https://twitter.com/YourAnonOne/status/1445082304393719818

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

0

Looking